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ABSTRACT 


- This paper addresses the problem of screening potential 
Varlables@roneentnance in a linear multiple regression set- 
ting. The purpose of the work presented here is to propose 
two screening methods, both of which have roots in principle 
component analysis, and which evaluate a combination of 
Variables in an efficient enough manner so that enumeration 
of all combinations is feasible even when the number of po- 
Peel aieevdrtaples is) quate large. Using the square of the 
[Mirtle meOiteldcion COctLLICIeént as the criterion, the se- 
lections made by these methods in several test Sdses ware 
evaluated, and compared with the selections made by the 
methods of total enumeration and stepwise regression. The 
paper concludes with overall evaluations of the two methods 


amd sitosestsyairections for further study. 
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I. INTRODUCTION 


The search for the important variables in a multiple 
linear regression setting is, due to its great practical 
importance, an area which has received considerable atten- 
tion. Researchers often use a large number of potentially 
important variables in the exploratory stages of their work. 
These need be screened in order to determine the most par- 
Simonious subset of these variables available for predicting 
Smecoullaring the response with an acceptable level of er- 
Hor. it appears that a total enumeration of all combina- 
tions of variables [Ref. 1] 1s necessary to be assured of 
making the best selection. This process is computationally 
overwhelming when the number of variables becomes even mod- 
Grote lywelargecws (Sach Combination requires a matrix inver- 
sion, and there are 2? such inversions to perform if p is 
the number of variables.) 

A highly popular approach to this problem is the use of 
stepwise regression [Refs. 2, 3, and 4]. It is basically 
doles cepmlook-ahkead method, and useS Signiticance tests 
based on distributional assumptions to judge the combinations 
of variables under consideration at each step. It appears 
UmeOmamadecuate JODIOL Seclection, and 15 readily avazrlabile 
mepackaceGahorm: (specitically BMD and SPSS). 

The purpose of the present work is to present and study 
ene walcemmtcives to Stepwise regression in hopes of finding 


Viable competitors. It is desirable that these methods 





should perform at least as well as stepwise regression, yet 
remain feasible computationally. In this regard the method 
of principle component analysis of the antecedent variables 
is useful and serves as a guide. 

This paper will pursue the development of these alter- 
natives in the following manner. A discussion of the general 
problem will be presented first, along with remarks concern- 
ing ways of measuring the effectiveness of the combinations 
of variables. Comments concerning several suggested screen- 
ing methods will follow, being followed in turn by a discus- 
Sion of stepwise regression. A development of the alternatives 
under consideration in this paper will be presented, and a 
comparative evaluation of their performance on some test 


cases will conclude the work. 





Ii. THE LINEAR REGRESSION MODEL 


In order to introduce the linear model, #t iS convenient 
to agree to a common notation. Let y be the dependent, or 
response, variable, and Spopne cook, DS Sole Wrepneseoll ope zhi 
tecedent, variables. Assume that N sets O77 525) 


are observed, and for convenience, let each member of the 


set be replaced with its deviation from the sample mean: 


y; + (yj-y) and x;; 
The response y is, of course, viewed as random. The an- 

tecedent variables may be either random or deterministic; 

it does not matter which. We are concerned only with the 

question of which subset of them should be permanently col- 


Feeeed and not with formal statistical inference per se. 


When means, variances, covariances, and correlations are in- 


tr@mmeced, they refer only to the sample quantities. Further, 


no Ggistributional assumptions are made about y. Thus de- 


cisions regarding the appropriateness of the various com- 


binations of variables are structured on ad zoe data analysis 


groundsand not on formal tests. 
: N 
Using the column vector Y to denote the 2) and the 
N 
Maewax X for the N sets of Ce 2 oe) ae the usual lin- 


ear model 


i 8 + ¢€ (2 


Nx1 Nxp pxl Nx1 


tomes oUNcd., wiere 8 represents the vector of regression co- 


etf@tients and ¢ the vector of residuals. TYTlwe @€stimation of 


eae il meen Jt aie: sNiNy Geer 


1) 


2) 





B is achieved by the method of least squares, the solution 


being [Refs. 2 and S]: 


gee (xt) XY (2.3) 


assuming X has full rank (as it generally will in data 
analysis situations) and the prime denotes transpose. The 
correlation matrix of the SS 96 20 oes variables is introduced 


as 


Ne (2.4) 


The computational problem associated with total enumera- 
tion can now be made more explicit. Consider a subset of 


Suzerq, (xX. 


p 
greene Xs ) of (Xy p++ Xp). There are CQ) such 


subsets, ae for each of these a covariance matrix (a minor 
of (2.4)) must be inverted to produce the corresponding 2 
(Eqn. (2.3)). This done, one must choose the best (in some 
sense) subset of size q and do this for each q=l,...,p. 
A MuMber Of Criteria for judging the fit of a subset 

of variables are available, including multiple correlation, 
Standardized total squared error [Ref 6], and variance of 
residuals. The approach taken here is to chart the growth 
of the square of the multiple correlation coefficient, R°*, 
as a function of q. This can be done for total enumeration 
Weis hie Suse Op rene XG leciatemaximizes R° for each 
q), for stepwise regression, ee the two alternatives to be 
imeroduleea.. Onecestch Charts are made, the user may choose 


q to mect his own needs. The researcher hopes that R* grows 


very rapidly and becomes very close to its maximum (achieved 
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when q=p) for small q. The worse case occurs when R* grows 
linearly with q. 

This approach seems simple and reasonable. No formal 
significance tests are made and the tenant difficulties 
connected with simultaneous inference are not addressed. 
Although the method of stepwise regression uses formal sig- 
nificance testing in its intermediate stages, the results 
of using it can still be compared using our simple ad hoe 
approach. 

Remark: In recent work with the method of ridge regres- 
sion [Ref. 7] the use of unbiased estimators Eqn. (2.3) is 
foregone and more general measures based on average squared 
esror dre used. In this method the principle diagonal of 
GicweOVvallancesmatrix Eqn, (D6) ae loaded in an effort to 
trade bias against a smaller mean squared error. Comparison 
of the methods presented here with those of ridge regression 
is not considered in this thesis. 

To present computational formulae for the measures of 
effectiveness discussed previously, additional notation is 
SonVentcmuame let is pes the pxl column vector of covariances 
between y and CS nous yeea le and let Cyy Demeney Varlance Of Yy- 


Micwvdthanece Or LTreSsSlallas -ean be EStimated as 


oe aA 18 Bay Gee as 
O vy Ee T (Y-XB)*(Y-XB) Cres) 
where 
Ss Went (2.6) 
Pmeicme st iNate Ot Teslduals, fhe square of the multiple 


COmGehaulom COckricient 1s [Ref. 5S] 


iE 





nn ce, Ee ste Se) / oe en 


ama € is given in Eqn. (2.4). When the model is reduced to 
eae ), the matrices and vectors must be modified 
Me ordingly. 

Three test cases are introduced to evaluate the methods 
under consideration (Tables I-III). There the pertinent 
quantities C, s, and ey are presented as augmented matrices 
in the format 
Cc 7 s' 

l 


ye 


S C 
t 

The first test case was generated specifically to expose 
a weakness of the stepwise approach, as will be seen. The 
second matrix was designed to be of sufficient complexity ta 
give a more enlightening comparison of the methods under con- 
Sideration, yet small enough so that the total enumeration 
solution could be obtained. Care was taken to ensure that 
the criterion of positive definiteness (see Appendix B) was 
met. 

Becalcestme first two examples are artificial in nature, 
an application using real data was sought. Such was ob- 
tained from a study currently underway [Ref. 8]. There, a 
Bovey conecrmimle subJective Teactions to a set of fourteen 
Smucs 1s being made to ascertain how the subjects perceive 
@he various drugs. Each drug is rated from ome to seven on 
each of the following fourteen scales; violenm€e, growth, 


Stowe ss, Cestruction, chancement, activitps. goodness, 


eZ 





avoidance, integration, positivity, permanence, speed, 
severity, and strength. 

One of the studies within this investigation is to de- 
scribe the scale "severity" in terms of the other scales. 
More specifically, what subset of the other scales "best" 
describes severity? This problem presents an opportunity 
to compare the screening methods being presented here with 
stepwise regression, and provides a third test matrix. 
This data is rounded to one digit to conserve space, and 
in this form may not be positive definite. See Table III. 


Of course, the original matrix was used in the computations. 


1S: 





Table I Table {J 
TEST MATRIX ONE TEST MATRIX TWO 


ce et eer eo Ol) 050 0.0 0.0 =291.0 0.0 0.4 


ae ere een o eS 0 V0 > 0.5 °-.4 0.4 0.5 0.0 1.0 -.2 


ee see ect ee 2 Oke 0.1 0.0 -.e0.4 -.2 1.0 


Table III 
Pes ive THREE 
(Entries rounded; see Ref. 8 for usable data) 
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Pl eeeeCURBRellin JokU SCREENING PROCEDURES 


There are a number of methods currently used to screen 
variables, among them forward selection, backwards elimina- 
tion, stepwise regression, and several graphical techniques 
[Ref. 6]. Because stepwise regression incorporates the best 
of two of the above methods, enjoys general acceptance, and 
is readily available as a packaged program, it will serve 
as a baseline for measuring the performance of the methods 
under study here, and will be discussed in greater detail. 

Stepwise regression is based on an underlying assumption 
of normality for the response variable y. As a result of 
this, sums of squares from several sources (including re- 
Sidual error, regression, and reduced model) have Chi-Square 
distributions. Thus at any step, a potential new variable 
may be tested for its significance if allowed to enter the 
system. Among those eligible to enter, the most significant 
(in terms of the F statistics being formed) is selected. 
Then those variables entered at previous steps are tested 
to determine whether their presence remains significant. 
Again, statistics are used as the criteria for deletion. 
When no variables can either enter or leave, the process 
terminates. Stepwise regression has the ability to look 
ahead only wone variable at @ time, and thus cannot guarantee 
am optimum solution. 

OnesOt Une PNOsueeurlouUs Charact@éristics of st®pwise re- 


gression as it is used in the packaged programs available 
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(specifically the BMD and SPSS regression packages) is the 
set of critical points for the F-statistic used by the conm- 
puter as criteria for entrance of variables. In order to 
reduce core requirements these packaged programs allow a 
single value from the F table to be used as the criterion, 
despite the change in degrees of freedom required each time 
epmcowoen Statistic is formed. Thus what may be a test of 
Significance at the level a for one F-statistic with p and 
q degrees of freedom will not remain so for a new F-statistic 
with r and s degrees of freedom. As a result, there is no 
easy way to control the actual level of significance of each 
test performed (much less the overall level of significance 
wipenre final combination chosen with the multaple tests). 
Further, the default values (for entering a variable) in 
both the SPSS and BMD regression packages are set at 
ee While these may be changed by the user, many 
users are unaware of the problem and rely on the default 
value. This default value is not one which most users would 
initially choose, as for most F tests this would correspond 
Pema Stonitreance level close to one.” (In fairness to step- 
wise regression it should be stated that the problem 1s with 
the packaged programs, not with the method ieee 

Two advantages to stepwise regression are worth noting. 
She first is that it only needs to look at a small subset of 
the total number of combinations before terminating, and 
therefore relatively large problems become computationally 
feasible. The second is that when the underlying assumptions 


are met the results of its screening seem to be generally 


Lte good. 
quite £00 16 





There are a number of disadvantages to stepwise regres- 
Sion. The first is that it is extremely difficult to "see 
into" the method and understand what it is doing at any step. 
The computer printout is confusing, and many users may be 
only dimly aware of what is happening. Consequently, the 
results obtained may often be unsatisfactory, as the user 
delegates too much analysis to the computer program. 

A second disadvantage is that of the underlying distri- 
butional assumptions. Robustness to departures from the 
normality assumption becomes an issue, as well as the pre- 


viously stated problem of simultaneous inference. 
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IV. SCREENING OF VARIABLES 


The method of principle components [Refs. 2 and 5] ap- 
plied to the antecedent variables provides 2 platform for 
introducing the screening methods presented in this paper. 
Further, the growth of the multiple correlation coefficient 
when the principle components are used as antecedent vari- 
ables is easily developed and serves as a rough standard 
for the kind of growth that may be available using the orig- 
inai variables. 

The mathematical structure is useful in that the compo- 
nent variables are orthogonal (uncorrelated}, and their 
variances are readily obtainable, as are their correlations 
with the response variable. The matrix of eigenvectors 
serwes to expose those antecedent variables which exert 
most influence over the orientation of the criginal data. 
Thus it seems reasonable that useful screening methods can 
be @@tained by first finding the important principle com- 
ponents, and then the important original variables that in- 
fluence these components. 

General developments of the method of principle components 
are readily available (see for example [Refs. 2 and 5]). The 
Salient properties used herein are presented below and, for 
sake of immediate reference, developed in Appendix A. 

Mic rotation of the vector of anteccdént variables x to 


tne@principle components vector v may be cxpseessed as follows: 
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v = W'x (4.1) 


where the columns of W, (Wy 2Wo2-+-sW)> are the eigenvec- 
tors Th Crmndierni 2. |e eOVarlance Matrix Of Vv 15 di- 
agonal with the variances being the eigenvalues. The 
covariance of a typical V5 with the response y can be cal- 


culated as 
Cov(y,v;) = Oa) = wik (yx) = Wis (4.2) 


and hence the correlations are 
wis 
aL J 


_— ee ——-— — sl EL W.. 
5) sea Lae eae eee YX, (4.3) 
i-yy i 


Tt 
YoV 


where Wig are the elements of W and Cig Eeere henents Olea. 
When the principle components are considered as being 
Plemameececent Variables, 4t 15 an Gasy task to chart the 
Squanewon thesmultaple correlation coefficient R- as a funic- 
ElOneOL the number Of Variables aq. One need only order the 
On 2s) SecOrahie.so the Magnieudes Of their corre!4a-— 


tions with y (given by Eqn. (4.3)): 


ae > os & > eeee > r? . a) 


5 q ‘5 q 1 p a 
R — es Pee & = .2 et AW C Teas: 
pc i=l “y,v i=l] i. j=l ji elie Van es 
2! 1 j 
(Ao) 
1 q 4 é : 
a cabana j21 sui 5 
yy i : 


WS) 





A. FIRST SCREENING METHOD (M1) 

Our goal is to select the best subset of size q from 
(2p 5+--5%))5 where maximization of R* is the criterion. 
The best q principle components are already in hand from 
(4.4). It seems reasonable to try to march this vector 
WEth the "closest" q-dimensional flat in x-Space. First 
we will introduce some notation: the new rotation to the 


subset of q principle components may be denoted 


vr nolieleearrk “pi |[ % 
= : (4.6) 
“as gee eq ie pl) 
That is, 
7 ws oe Cae 


qxl qxp pxl 

Note Sthat_ W* consists of q eigenvectors, each of which is 
cOmmlete (that is, the deletion 1S among the eigenvectors, 
not across them). 

mY MOWenave qd Of the principle components represented 
as linear combinations of all p of the {x,}. The second 
step is to reduce the number of x. He UdRed we NOteCs Chae stile 
2Gemot Gropping some Of the xX; may be viewed as the removal 
of g#e corresponding columnsof W*' and their replacement with 


cokuumns of zeroes. Thus 


Vice Wee Ox : (4.8) 
qxl xp 7p cl 
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Finding the "closest" subset mentioned above will be 


accomplished by 


min E | |v*-v**| |? (4.9) 


where the operator E refers to averaging. Such a minimiza- 
PROneMUSE take place for eacn q-l>...,p. [he solution te 
this problem will be referred to as method Ml. 

In terms of the elementary quantities, (4.9) may be 


written as 


Peer ae NA es ell pig |) [= 


2 
pe Eh omy wD; (4.10) 
p 
a 


k owt % atx 2 
1 5Fa da CwF WHT) Ogg WED ECG". 


INn@oOGder towdesehine the Minimization precess of (4-9). 
[cero De sd SetrObhed.sumSerints Of the variables, considered 
HOt Ic lhUalLOneiiecne sEeoresSSlOngeSOutwac 1ts=complement , i, 


GOntalIns SUDScriIpts Of those varilavles to be excluded, Then 


(4.10) becomes 


q 
E ve-y#*([[2% = F y i Oy eC eee AS 
= jee keg 2% bas aa 


RUGenet ss hetnOu— Tae where 


ee as 
Q. = (4.12) 
0Oif is 


famally Leto Z bee defined 
PXp 
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q 
(Z)y5 = G21 GMa (4.13) 


It follows that (4.10) can be represented as 


(4.14) 


Pp p 


EL Lveavee ee ee tpi pba bn AjaCjn OG) O-Q)- 


The closest v** may be bound by forming all @ subsets S of 
Size q, then computing (4.14) for each, and choosing the 
smallest. 

Bietomusetulemo express this algorithm in another way. 


From (4.10) we obtain 


gq Pp p 
koarkK* 2 kt LWk* kt weet 
E| | v*-v**| | aly 524 yy (WET -W "45, @ '-W dix 3K 
(4515) 
= : ( (Ws! -We* i) t (Wet Se) oC 
j=1 k=1 Sg 
Since 
Weet).. = (W*'). .0. 4216 
(WEE) = (WE), Q, (4.16) 
we see that 
0 Prix ets LO be tmeluiaced 
(SEI) J (4.17) 
J Sige ats 8 x; is to be exciuded 


yielding a form that is easily computerized. 


Bee OeCOND SCREENING METHOD. (f2) 


The rationale for the second method begins with the ob- 


Semvatton that when (4.5) 1s used to calculate Rae (at q=p), 
P 1 P “| 2 
Sp ee = 
ia ey i2y hs bh ie 55 | (413) 





- one can decompose the inner summation into two parts; one 
associated with the variables to be kept and one with the 
Variables to be deleted. The former is indexed by the set 


S and the latter by its complement S. Letting 





o W..S. ee) = dees (4.19) 


p 2 
2 — =—_— — 
Rnax ~ i%1 EF: ts; 3 #5 a 
7101) 
= Roe + ; DY ag deco) GRR a 0 be cs CE ago ae 


This serves to define R” for each set S of indices of vari- 
ables to be kept. It seems reasonable that a set S that 
maximizes R*? should define a good set of varwables to re- 
Cain. The process of determining the set Se#ill be referred 
to as method M2. 

The maximization problem may be couched nicely in mathe- 


matical programming notation as follows: 


2 
1 F: 65% | 


1, 85 = 4- 


maximize R*° = 
(araz ly) 
subject to 
The expression also lends itself well to computerization. 

As in method Ml, we must gencrate all possible Q vectors 
DmGmGOnouLe mall possible Candidates for (4.28), ultimately 
cheesing the largest for each value of q. 

Mie iS interesting that this method can Bei developed from 


manotarer point of viéw. From (2.7) the squagé of the 


ZS 








multiple correlation may be expressed as 


Ra = = s'C s = —— Trace issue = 
oe yy 


se (4.22) 


In hopes of finding a viable screening method, we again con- 
Sider a subset S of subscripts, and limit the above trace 
SOMpuedtehoOwecOeclements Of S. lo be more explicit, let 


iD) Diag(Q,,---,Q,). (4.23) 


Then let us consider screening the vector s with D by looking 
at the pxp matrix Dss'D. We may think of this as a matrix 
which has been stamped with a grid so that non-zero elements 
can be found only in those rows and columns both of whose 
miarees belong to S. Comparison of all possible “stamps” 

of order q allows us to choose that combination which maxi- 


ies omic stuacewine (4.22) ratarron. (4.22) becomes [Ref 79] 
(4.24) 
oe = — Trace(Dss'DC ”) = — Trace(ss'DC *D). 
yy ey 


However, we can show that this 1s equivalent to method 


M2 as follows. Let 


‘xp = E(vv') = E@W'xx'W) = W'E(xx')W = W'CW 
: Aree 
= Diag(Ay,+--A,) ( ) 
onde Notice that 
pa T -]1 ma -1 i 
C = WAW! and C = WA W (4226) 


properties that follow from the orthogonality of W. It fol- 
lows that 


Ht ] 


Deneeiss'DC Di) = Traceiss' DWA “OW'D). (a2) 
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p 
= ae = 
(WA ~W dy eal ek (4.28) 
it follows that 
-1., _ Pp 
(DWA ~W D4 = KZ] WW GQ Qa fy (4.29) 


Since (ss')5. = S;8,, (4.27) becomes 


-1,., p p p | 
WeDY = By p21 21 95S eh eK 5j Kr Qk 


(4.30) 
2 


Trace[ss'DW 


p p p p p 


— Syy 82 521 k21 kj tkrSr8y = Syy k#1 1521 tej Qy! 


using (4.19). Comparison of this with (4.2@) completes the 
Dreote Hence method MZ may be thought of #25 an attempt to 
approximate the inverse of a qxq minor of C by a stamped ma- 


trix pc ln. 
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V. RESULTS AND CONCLUSIONS 


Dee kesUL lS 
As was mentioned previously, the multiple correlation 
coefficient is a convenient criterion for judging the ef- 
fectiveness of the combinations of variables selected by 
the screening methods under discussion. Thus the results 
of the screening methods can best be summarized with the 
graphs of multiple correlation versus the number of variables. 
The first test matrix, as mentioned, was designed to 
expose a weakness in stepwise regression. Figure 1 indi- 
Grecsmuliceduite= well ay Ine printout Or the oPSS, regression 
program may be reviewed in Table IV. Variable X, was se- 
lected first because it was highly correlated with y. Step- 
3° Variable x4 remained 


Sieniticane, and was left in the equation. Stepwise regres- 


wise regression then chose variable x 


Sion then selected variable Xy, found it significant, and 


terminated. It never observed the pair Xo 2X5 alone, which 
in this case was a much better pair than the one stepwise 
regression chose at step two (X] 2X3). Further, had stepwise 
regression chosen X5 Xz, it would not then have selected xy: 
This can be seen in Table IV. The reason for this is as 
follows. The square of the multiple correlation (R*) of 

is a great deal smaller than R? for x 


variables X41 9X ax 


53 2. ee 


Thus when stepwise regression considered adding xX, to toe 


set X,,Xz, 


the new variable. On the other hand, when it was given the 


the sizable increase in R° caused it to accept 
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COGRRELATICN COEFFICIENTS 


A VALUE OF 99.00000 [S$ PRINTED 
IF A COEFFICIENT CANNOT BE COMPUTED. 


Y X1 me X3 
Y 1-C0000 0.60000 0.50008 0.5 
X1 0:60000 1.00000 0.5000% 0.5 
X2 0250000 0.50000 1.00008 Ol 
X3 0.50000 0.50000 G.10000 1.3 
DEPENDENT VARIABLE.. Y 
VARIABLE(S) ENTERED ON STEP NUMBER le. x3 
X2 
MULTIPLE R 0. 70238 
R SGUARE CAE 
STANDARD ERROR Onases 
= ee VARESE SIN 1 HEMEUUAT Mew ——=——=—— == = 
VARIABLE B BETA STD ERROR B 
X3 : Voss apices 9.12362 
Let Ce 2c6O0 7 Oe OOO UT He Le¢Z2ilu0 
x2 0.33333 0.333353 2.12368 
(CONSTANT) Vee ee 


ALL VARIABLES ARE IN THe EQUATION 


DEPENDENT VARIABLE... Y 
VARTABLE(S) ENTERED ON STEP NUMBER 1.. X1 


MUS Pie K 0.60000 
R SQUARE » 36000 
STANDARD ERROR 0.80829 


-- —— ------------- VARIABLES IN THE EQUATIOQN --------- 


VARIABLE B BETA STB ERROR B 


Xl 60009 0.60000 0.11547 
(CORSTANT) QOVU00 


Tabicee lV. 


fa 


——e ee Se ee ee ee 





whe 


od 


xe eK eK OK KR KK KK KK KK HK KK KR KR HK RE KR RK KH HK HE KK 
VARIABLE(S) ENTERED ON STEP NUMBER 2.. X3 

MULTIPLE R 0.64291 

R_ SQUARE 0.41333 

STANDARD ERROR 0.78207 

~---------------- VARIABLES IN THE EQUATION ------------------ 
VARIABLE B BETA STD ERROR B F 

X1 0.46667 0.46667 0.12901 13.085 
x3 0.26667 0.26667 0.12901 4.273 
(CONSTANT) 1. 33333 

DEPENDENT VARIABLE... Y 

VARIABLE(S) ENTERED ON STEP NUMBER le. x3 

MULTIPLE R 0.67420 

R_ SQUARE 0.45455 

STANOARD ERROR 0.75410 

~-=---------~----- VARIABLES IN THE EQUATION ------------------ 
VARIABLE B BETA STO ERROR B F 

X 3 0.45455 0.45455 0.10827 17.625 
X2 0.45455 0.45455 0.10827 17.625 
(CONSTANT) 0.45455 

mow x ek kK Oe oe Ge oe ey eS eee Rae ee be 
VARIABLE(S) ENTERED ON STEP NUMBER 2.4. Xl 

MULTIPLE R 0.70238 

R_ SQUARE 0.49333 

STANDARD ERROR 0.73465 

---- ------------- VARIABLES IN THE EQUATION ~--~-------------- 
VARIABLE B BETA STD ERROR B F 

X3 0.33333 6.33333 0.12368 7.263 
X2 0.33333 0. 33333 0.12368 7.263 
X1 0.26067 0.26667 0.14210 2.522 
(CONSTANT) 0.33333 


Pee VARTABLES ARE 


IN TRE “EQUATION 


Taple Ty. secon einue d. 


28 





DEPENDENT VARIABLE... vf 
VARTABLE(S) ENTEKED UN STEP 


_—we fe ee Se ae 


GNST AN 


MAXIMUM 


R On 102338 
0.49333 
ERROR 0.13465 
ee ee VARTABLES 
B 
0.26667 
0.33333 
Oo 35 55 
T) 0.33333 
STEP REACHED 


aloe ml Vie 


aS 


NUMBER 3ee6 X2 


IN THE EQUATION ------------------ 


BETA STD ERROR B F 
O.2666 1 Oel4210 Seieee 
Ge55555 6.12568 Tecoo 
0.25555 O.12366 1+2o3 

continued. 





opportunity to add Xy to the pair Xo sXq5 there was not suf- 
ficient improvement, and X, was not added. Since stepwise 
regression missed the best pair, it did not terminate until 
all three were included, whereas it could have terminated 
With two variables had it found the best pair. 

Note also that while method Ml falls into the same trap 
as stepwise regression, method M2 selected the same combina- 
tions as did total enumeration for each q=1,2,3. The reader 
will also observe that the principle component multiple cor- 
relation curve serves quite well in this graph, and the two 
that follow, as a standard with which the other methods may 
be compared. 

It is interesting that this curve and the total enumera- 
tion curve are generally very close in the cases presented 
here, and in fact cross each other in the second case. This 
observation is useful since in many applications the total 
enumeration solution is unavailable. 

Figure 2 indicates the results of the various methods 
with the second test matrix. Nore that while the R* curve 
for method M2 runs well with the total enumeration and step- 
wise regression curves (identical curves in this case), the 
curve for method Ml runs consistently lower throughout the 
entire midrange. 

Figure 3 presents the results of the scmeening methods 
Cielo weiner est Matrix,  %Ihis aS Of particular interest 
because ene results shemid be useful in the study cited ecar- 
hci OtCmactin that the curve 1oreMZ runs quite strongly 


hetn tie Curve for stepwise regression, but that the curve 
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. for Ml falls consistently below the others. Throughout the 
midrange of all three test cases, the curve for Ml is about 
two-thirds of the total enumeration curve. 

Because the second method shows greater promise as a 
screening device, more detailed results on its performance 
will be presented. Unlike stepwise regression, this method 
is capable of ordering all combinations of size q according 
to their R** values. (Stepwise regression will typically 
only look at one or two of the combinations.} Thus, Table 
V contains a summary of the combinations selected by the 
second method, M2. 

From this table, the graph in Figure 4 was constructed, 
giving R** as a function of the number of variables. This 
curve seems fairly typical of what can be expected. Note 
that as the number of variables allowed to enter increases, 

R** initially increases. At some point the curve will peak, 
then decrease to the multiple correlation value of the en- 

tire set of variables. This phenomenon can be explained in the 
following way. The value of re (see 4.19, 4.20) may be of 
either sign. When the number of variables is small, it is fair- 
ly likely that some combination of variables will be such that 
the ci Will combine with the same sign, so that when squared 
and summed again the value may get quite large. (In the three 
examples used in this paper, the value typically cxcecded one.) 
However, as the number of variables increases, it becomes much 
more likely that the oie Gubler i ston. will conftiict 


micheeach other and reduce the overall values being squared 
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Q=1 


Vbles Rx 
xy 658k 
X5 0015 
Xp olf 30 
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Q=3 
Vbles 


XK site a 
X4 or 
x, Wea 
x4 * 475 
x4 uae 
aS 
a eras 
en 
= ia Tt 
ee a 


Rise 


1.13 
0871 
0650 
0627 
0550 
0518 
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aiidecumicd.. lhe jcurve im Fies=e 4 seems typical of what 
may be expected. Thus at best these values are pseudo- 
correlations, which have been si:.:n to rank well with, but 
not predict, the ranking of the actual multiple correlation 
coefficients. As a result, this method cannot guarantee op- 
tii eeke ce baOins. 

Table VI gives the rankinesof total enumeration and of 
method M2 for each combination of size q=3 and q=4 from 
test case two. In this way, the results of the method may 
be more fully evaluated than to consider only the "best" 
selection made at each step. Using the rankings given in 
the table, Spearman's Rank Correlation test [Ref. 10] was 
applied; at a=.1 the rankings of total enumeration were 
Significantly correlated with the rankings of M2 in both 
cases. This lends credence to the second screening method, 
ana also indicates a convenient property of this method; the 
capability to rank all combinations at each step. 

Finally some comments on the use of poe ipats components 
are in order. It appears to provide a useful standard of 
Gonpablscon wor tie Other methods presented. Indeed, the 
method may be of more direct use in some applications, es- 
Becially when researchers find that p>N. In such cases the 
Seer 1relemtoeeamay not be directly estimable, but any ver- 
Sion of least squares will result in c=0, Puinci ple compa 
Domi verPeeUcc all for prelamindry screening so that there 
are @nough degrees of freédom N-q to obtain an acceptable 


esbimate tor Variance of residuals. 
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Q=3 Princ. Components Multiple Correlation 


Variables Rx Rank Re Rank 
Kye %, 1.126 1 092 2 
XXX, °871 2 055 1 
X4X5%), 2550 3 oth8 in 
x, XX. 0627 hy ol} 87 3 
47%, 0550 5 ot 31 6 
* 6%) 2. 0518 6 0278 9 
x, XX. 72 7 0) 37 5 
Boats ol 06 8 © 301 8 
* 3%) %5 © 351 9 aon 1 
KyX 47h, 0296 10 0223 10 

Q=) Spearman's p = 0.93 
Kj XoXo, 1.000 4 0593 4 
X42 4%) %,, 0676 2 lt 98 3 
>, Xo% 3% 0565 3 o 59 2 
x, Ko% Xe eht26 Ly oi 78 Ly 
HF 4% Xe as) 5 329 5 
Spearman's p = 0.80 


Zable VI. 
Compared Rankings of the Second Method With 


Enumeration 
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' B. CONCLUSIONS 

On the basis of the results just presented, a number of 
observations concerning the screening methods under inves- 
tigation in this paper may be made. 

The first method (Ml) does not appear to perform par- 
ticularly well. Its selections were consistently worse 
than those made by the other methods. It may be that its 
disappointing performance is caused by the weak correlation 
of the major principle components with the response variable. 
As was seen in the three examples presented, this method 
does not in general even approach the optimal combinations, 
and thus shows little promise per se as a screening technique. 

Method M2 seems to be a reasonable approach to the prob- 
lem. While it will not in general make optimal selections, 
in the examples used here it has done quite well. 

The method has several advantages which are worthy of 
mention. The first is that after an initial investment in 
obtaining the eigenvalues and eigenvectors of C (roughly 
equivalent to inverting it, in time spent), the amount of 
CMe nequrned CO CXamMine any potential combination is very 
Hite Ssaerestle Ol Chis, Enlimeratilon so: the combinations 
becomes feasible even when the number of variables is fairly 
large. Appendix C contains some remarks concerning an algo- 
rithm which will enumerate quite efficiently, and which was 
used in the FORTRAN program presented there. This program 
outputs the largest value of eae and the combination that 


Producedist, £67 Gach value of q. 


oy 





It is worth noting that the manner in which M2 screens 
variables is much more readily apparent than that of stepwise 
regression. The user is aware of the process by which vari- 
ables are screened, and is more able to make intelligent use 


Of Pt. 


C. SUMMARY 

Phe results presented here are at best té@ntative, since 
only three test cases are presented. Certainly a wider spec- 
trum of test cases is necessary to establish conclusive re- 
sults for the methods presented in this paper. Neither stepwise 
regression nor method M2 is optimal, but both remain as vi- 
able competitors which are useful as screening devices. 


Because of the success of M2, we are led to consider the 


j 


possmeaility of other methods of approximating the inverses 
of the minors of C, which may be as efficient (with computer 
resources), as those presented yet produce results which 


compézre more favorably with total enumeration. 
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APPENDIX A: PRINCIPLE COMPONENTS 


The structure for puamcaple components | Ret. 11) arose 
from a desire to find that location and orientation of axes 
that the variance of the data swarm about them is minimized. 
The necessary location of axes follows from a linear algebra 
theorem which states that a sum of squares centered about an 
arbitrary point iS minimized when that point is the cen- 
troid. Thus it becomes convenient to standardize the data 
about their means. In order to define the desired orienta- 
tion, let us first agree to the following conventions. 

Let the data swarm under consideration be in p-space. 
Let the X matrix contain the p coordinates of the N obser- 


vations, and let 
7 (ie oe) Se ed) 


Tien let the @ matrix have entries 


Co = ((x5 5° ,)/s,)- 2) 


iteteas, lettre X matrix contain the p coordinates of the 
NeObSeCTVAtIOMS aditer the Variates tmave been) standardized to 
zero mean and unit variance. (Note that unit variance is 
Hem mecessdry eto what follows.) <Thenga@t aollows that the 


Conte) 4ation matrix is 


The rotation itself may be denoted 


ay) 





v= W'x | (A. 4) 


where v = eee ip) * = a and Woxp 1s 
a matrix column-oriented vectors, each of which provides the 
coefficients for transforming the x vector into one of the 
component directions. 
For p=2, this may be displayed graphically (see Figure 5). 
As stated previously, the goal is to minimize the variance 
about the component axes by choosing the matrix W. However, 
Slewtay dO this meatly by setting W-0, and thus avoid the real 
problem. To obtain a tenable solution we may constrain the 


problem by requiring that the norm is one: w'Ww=1. The vari- 


ance about the component directions may be written 


See ee a ee ey oe rw Ce ay 
(A.5) 


I 
=, 
C?) 
wnt 
=, 


Because minimizing the variance about an axis iS equiva- 
lent to maximizing the variance along that axis, and because 
S* represents the variance along the axes Vj, our problem 
Webi beetommacimize S-. To show that thas is true, refer to 
PEgUEewO saNOLe that to mMinimpze “the varvance in the V5 le = 
rection we must maximize along the (orthogonal) Vy direct ione 
Orthogonality will be shown later. 


IMIStineaproptlem of finding the direction Vy may be 


written 

maximize w'Cw subject to w'w = 1 (AZG) 
Whe PGs Weersmedin arbitrary column of WW. This problem may be 
Poly cumconvenmv@ntiy by the use of LaGrange multiplicrs. The 
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V{=X1Co0s8+x,sine 


Waa wat Wa ae 
lee lee ic ee ¥5=X,SinO+x,cosé 


WO ase 


Pigure 5. Principle Component Rotation. 





Fagure oO. 9 Maximizing the Variance Aloim® Vy: 


4] 





LaGrangian 1s 


6, = wiCw - A(w'w-1). (A.7) 


The LaGrangian is maximized at V$,=0 


Vo, = 2(C-AI)w = 0. (A.8) 


Then 


(e-1 1) (A.9) 


Nt 
ane 


Now the matrix C is real and symmetric, and will in gen- 
eral be positive definite; it can be shown to be at least 
DOsSTielVesSecMil-derinite as £ollows: Let Z be a non-negative 
VeGmOlLe SOmenatss GL = Z'X'XZ = (XZ)'XZ. Now let Y = XZ, 


ley ee eh One alla eee vector 
ste JE ate 1 


Mz, 


Sorthat (XZ)'XZ = Y'Y = 
La 

As a result, it can be shown ;Ref. 9] that C is non- 
Simcltlar withereal, Mon-negative Clilgenvalues. 

Wet caqtiemenac (6 -Al)=05) ir (C-Alj 1s non-singular, 
ENenOnly Solution to the simul€aneous equations is w=0, 
which violates the constraint w'w=l. Thus we require that 


(C-AI) be singular, and it follows that 
|C-AI| = 0. (A210) 
imen, Chesseteor p solutions ds to this equation must be the 


Cmaaer cm iomte aA ues (On stile Maer. Co Pre=multiplying 


equation A.9 by w' we obtain 


166 (COSMO RS '.0 = Q. ‘eigeakas 


I 
we! 
<, 


Then 


w'Cw = w'Alw = Aw'w = i. (ep. 2,) 





But w'Cw = S*. Thus A is actually the variance along the con- 
ponent. Since we wish to maximize the variance, the solution 
we desire is the largest eigenvalue. 

Because } is an eigenvalue, (A.9) implies that w is its 
associated eigenvector, and thus we maximize S* by choosing 
the eigenvector associated with the largest eigenvalue as the 


direction v,. Then call this eigenvector W, and its eigen- 


il 1 


value di: 
The direction V5, may be found solving the following equa- 
€uon: 
Saas t : t - ' = Z 
maximize wyECwy subject to WLW, 1 and WIV, Oy 1CA. 1S) 


The last constraint says that we require Vy and V5 to be 


orthogonal. The LaGrangian is 


5 = WpCw, - 8 CwywWp-1) - 2t Ory, es aa) 
and 
Vo. = 2Cw, - 20w, - 2tW = 0. (A215) 
Then 
(C-OT)w, = = Or (A.16) 


Pee multiplying by Wh 


Wi (C- 81 )w, - = 0 > wy (C- 81) = @> 


WEWy 
(A.17) 
wpCw, = Owi Iw, = Owlw, = 0 > S* = 6. 


iiisse ws ene of the e1genvalues of the CY@mtrix, and Wy its 


PoSOeciatcasc mICTIVCCtLOn. ) Now pre-multiply A216) by Wi» 


' - = ' = ' - = 
w,(C OI )w, TWyWy G) ae: w1(C I)wy ee x2 


CAs) 


t x rf — _ t t 
Ww, Cw, Ow, t wy Cw, 
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. Pre-multiplying (A.9) by wy, 


= = e = t = ? rf _ 
wy (C-AT)w, = wy 0 0 > wpCw, = Aww, > wyCw, = 0. (A.19) 


1 
Then . 


wy Cw, = T = 0. (A.20). 


Note that because tT turned out to be zero, the constraint 
that forced the new component to be orthogonal to the first 
component was not Banca: that is, Wh is amherently orth- 
ogonal to Wy This follows because C is a real, symmetric, 
posifeve-definite matrix. Since t=0, (A.16) becomes iden- 


tical in form to (A.9), so that @ is an eigenvalue, Wy its 


associated eigenvector. It was shown above that 8=S, (6 is 
the wariance in the V5 direction), so that it follows natural- 


Pym 9 3s the second-largest eigenvalue €iet O=A,). Then 


Wy becomes w 


is 
This argument can be generalized so that ds, Eine que 
layeest eigenvalue of C, is the variance in tie ae best 


component direction Vi» and the associated eigenvector wy is 
Phewseet€ of coctficients mapping the x vector into V.. Now 
let Anxp be a diagonal matrix with the ordered eigenvalues 


(lam@est to smallest) in the diagonal. The rotation we de- 


sire as then 


Ve Wx, 


and ite correlation matrix of the ue See 
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APPENDIX B: PROPERTIES OF THE CORRELATION MATRIX 


Throughout the course of this paper, and especially in 
the numerical examples, there has been an implicit assumption 
tiated aenmatrax 1s reall symmetric, and posipive deéfinigse, 
it qualifies as a correlation matrix, and that it would be 
possible to find real data that would generate such a cor- 
relation matrix. While this is not overly dm@ficult to show, 
it is of general interest, and is not contained in many text- 
books as a complete derivation. 

The mathematical statement of the problems is to find a 
Mac rEx x 


such that X'X/N = where C is given. It has 


Nxp “pxp? 
been shown previously that to quality as thewomeduct of a 
Mit tnd Teo LanspoOsc, © must» bC posleiveceseml -derinitcc. 
(it C has full tank, it will be a positive desinite matrix.) 
Lieber a Dy thew derini tion ot Oonrelation bemen two Vari — 
anes CG onust He, real sand symmetric. 

TOmtinGderie NK Natrie. the inatial steps to find a2 tri- 
snuelets Ta G7 1136 Tee Sucheenaty lt 7 Celt eiss possible to 
SCONPUbe  tMemclcientis ot the l matrix Dy Carrgine out the 


multiplication (shown Dio Ehasmecase som awoxo ep Let I. be 


Hppen =) Pameular, so that 


ee a te ae es ies C12 C5 
Ti2 422 9 | ae pe ee 12) (22 23 
T1i3 123 ss] 0 0 613. Fis “25 “334 


2108 


4S 





T..T.. T..T T..T C.. Cc 


iii Sey fees es 11 “12 © 
.T a 
Taata2 Ta2tiztte2b22 Piztiztt22!23 C12 22 © 
Taataz Taslaztto3%22 '13%13% 23! 237! 35! Ciz Go © 
Thus 
Tay = Cy) Ti = Cy / Cy, Ty3 = Cys / YCyy 
en ee . De oe rks ieee 
22 22° “72 11 Sa eat ee ar 
22°12 11 
ee ee hess 0) Sac + cr Oe a 
$5 $50 O44 Con ~ Gin / On 


Note Ehatedtecvery Step tt.1S possible to solve for Tee 
in terms of the og 2 and oniy those other Tee for which it has 
deready Decn pessiple to Solve. This may be done in genexzaLl 


fOrea Cematrix of any Size p. 


Working backwards, we have that 


let = oe Ne (B.1) 
Then let 
2 (2) eee (B. 2) 
Nxp Nxp pxp 
Then (B.3) 
=) eran Chen? (yn = = 1 


DeGy2)- N{(051), and we_obtain the following transformation: 
= Lh x CR) 
Then 


See Ib C2 = Pir eS Pr = C. 
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Thus having obtained a matrix Z containg N observations 
On @ p-variate standard normal, it is possible to transform 
Z toe anew set of N observations on a p-variate normal 
(though no longer with unit variance) which will generate 
the desired correlation matrix through the judicious choice 
of a triangular matrix T. Thus the only requirements on a 
mateix are that it be real, symmetric, and positive-definite 


in order to be a correlation matrix [Ref. 9]. 
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APPENDIX C: FORTRAN PROGRAM OF SECOND SCREENING METHOD FOR 
APPLIEGATIGN IN bist CASE THREE 

Because of the generally good results given by the second 
screening method suggested in this paper, the FORTRAN program 
which was used to apply it to the third matrix is listed here. 
The program does the following: 

1. It reads a correlation matrix (then reverses the last 
two rows and columns, as in this case X13 is to be the re- 
sponse variable) and also reads a standard deviation vector. 
From these it obtains the covariance matrix. (Lines 1 to 40, 
beginning the count with the dimension statement. ) 

2. It calls a subroutine (JACVAT) which calculates the 
eigenvalues and eigenvectors of the covariance matrix. (Lines 
See EO 510) ie 

>. the equataon developed under the first approach to 
the second screenings method 1s used to calculate the multi- 
ProncOrrelattomeCOculLICICNt. me ne Pringoue.,eumeer Che lapel 
"R-square equals," and showing the sums and squares, may be 
sede tovapply tie Sereenine Method by mand -se(uines 5S0-/5.) 

ie covariance matrix's inverse is cakeulatedyror 
Peaster TOME NeC = SUDTOULING WSEREImM Ordemtoruse 1f With the 
Second approeen to the second method {lines /4-90.) 

Sa Nemec MCs DaLE Om wulcm program aS the driver program’ 
for erte sSUuDToOucIne LNIDDI (vhien provades the vector of ones 
and zeroés). It provides the initial vector (Q), and per- 


forms the one-zero exchange. (Lines 93 to 120). 
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G6. The TWIDDL subroutine*< "an algorithm originally 
written in Algol for the Ass #-#.::-. of Computing Machinery, 
and is listed as Algorithm Ben the CACM. Essentially 
at OTe every possible combination of m ones and (n-m) 
zeroes, but each new vector requires only the interchanging 
of two positions in the aReHACUS TOREOR [ikese 4 1 |Ic 

7. The USER subroutine uses the inverse of the covariance 
matrix, the indicator vector Q, and the augmented covariance 
matrix to screen variables using the second approach. For 
eaeh Valuc Of sl, lt Continually Stores the largest value of 
Re that it has calculated to date, and when m increments it 
PhiMes eke Val Ue son Re and the .,2:0r Q which produced it. 

(Using this program as a backbone, it is a straightfor- 
ward matter to obtain a program which will apply the first 


method to the data matrix.) 
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